k-combinations of a multiset

2022-02-08 · 5 min read

def: Multiset #

A multiset is just a set that may contain duplicates. For example, [a, a, b, b, b, c] is a multiset in a flattened, array-like representation. We could just as easily write it in a compact, map-like representation: { a: 2, b: 3, c: 1 }. Here, we take each unique element and associate it with its number of occurances in the multiset.

Multisets are a natural representation for many common objects in combinatorics, and are often used to model rolling multiple dice, counting cards by suite, and so on.

When representing multisets computationally, it's often convenient to represent them as a sorted array. That way, we'll have exactly one canonical representative for each multiset. This trick can enable further optimizations for us to exploit.

def: k-Combinations of a Multiset #

The k-combinations of a multiset are all k-length combinations drawn from the given multiset. Since these are combinations and not permutations, the order is not important (so the permutations [1, 1, 2, 2] and [1, 2, 1, 2] and [2, 2, 1, 1] and ... are all equivalent).

The elements are also drawn without replacement, so an element only appearing twice in the multiset should only ever appear at most twice in any combination.

To get a better idea of what we're talking about, suppose we've just rolled 6 dice and they come up as [1, 1, 2, 2, 2, 3]. How many different ways can we select 4 of the 6 dice to re-roll? As you can probably guess, these are just the 4-combinations of the above multiset.

Listing out the 4-combinations of mset = [1, 1, 2, 2, 2, 3] in lexicographic order:

[1, 1, 2, 2]
[1, 1, 2, 3]
[1, 2, 2, 2]
[1, 2, 2, 3]
[2, 2, 2, 3]

Insight #

If you stare at the list above for a bit, you might notice something interesting. If we list out the combinations in sorted order, the first combination is always mset[0..k] and the last combination is always mset[n-k..n], where mset the multiset, k is the length of each combination, n is the number of elements in the multiset mset, and a..b is Rust range syntax for integers between a (inclusive) and b (exclusive).

This property will come in handy in a moment when we write an algorithm to generate the k-combinations, since these two k-combinations give us our start and end points. In other words, we'll start generating combinations from mset[0..k] and we'll know we've finished when the current combination equals mset[n-k..n]

I'll leave the general proof as an exercise for the reader : )

Algorithm #

Here is the core function, which takes mset, a sorted multiset slice, and arr the previous generated combination. We modify arr in-place to the next combination.

If there are no combinations left to generate, the function returns true.

fn next_mset_comb<T>(
	// this is the original multiset in flattened,
	// array-like representation
	mset: &[T],
	// this is the _previous_ combination, also in flattened
	// array-like representation
	arr: &mut [T],
) -> bool
where
    T: Clone + PartialOrd,
{
	let n = mset.len();
	let k = arr.len();

	let last = &mset[(n-k)..n];

	// find the rightmost element in `arr` that is less
	// than the "max" value it can have (which is the
	// same as the element in the final combination at the
	// same index)
	let maybe_next_idx = (&*arr).into_iter()
		.zip(last.into_iter())
		.rposition(|(x, l)| x < l);

	let (i, x) = match maybe_next_idx {
		Some(i) => (i, &arr[i]),
		None => return true, // done!
	};

	// find the successor element (the next element in the
	// multiset greater than arr[i])
	let maybe_succ_idx = mset.into_iter().position(|y| x < y);

	let j = match maybe_succ_idx {
		Some(j) => j,
		// since we've already checked we're not at the end,
		// it cannot happen that there is no successor.
		None => unreachable!(),
	};

	// replace arr[i] with its successor and all subsequent in arr with
	// the subsequent in the original multiset.
	arr[i..k].clone_from_slice(&mset[j..(j + (k - i))]);

	// not done yet : )
	false
}

We can also use the above to write an Iterator over the k-combinations. This Iterator has a nice advantage in that it only uses constant space while iterating, so no need for a big temporary array.

struct KCombsIter<T> {
    mset: Vec<T>,
    comb: Vec<T>,
    done: bool,
}

impl<T> KCombsIter<T>
where
    T: Clone + PartialOrd + Ord,
{
    fn new(mset: Vec<T>, k: usize) -> Self {
        assert!(k <= mset.len());
        assert!(is_sorted(mset.iter()));

        let comb = mset[0..k].to_vec();

        Self {
            mset,
            comb,
            done: false,
        }
    }
}

impl<T> Iterator for KCombsIter<T>
where
    T: Clone + PartialOrd + Ord,
{
    type Item = Vec<T>;

    fn next(&mut self) -> Option<Self::Item> {
        if !self.done {
            let comb = self.comb.clone();
            self.done = next_mset_comb(&self.mset, &mut self.comb);
            Some(comb)
        } else {
            None
        }
    }
}

See the complete code in action on the Rust Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=6d08a5fa898eb5e43f9a0f7d30fdd0ad

The above may be slightly wasteful for certain usecases, since it needs to allocate on each iteration. For your specific usage, you'll want to specialize the representation of the multiset itself in order to squeeze out every last drop of perf.

For example, in my Kingdome Come: Deliverance - Optimal Dice Strategy algorithm, you can see the bit-packed DiceVec, which represents a roll of 0 to 6 dice packed into a single u64 and the likewise specialized iterator DiceVecMultisetsIter, which uses SIMD to manipulate the multisets and doesn't need any intermediate allocation.